NLP

BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

本文介绍了一种新的语言表征模型 BERT——来自 Transformer 的双向编码器表征。与最近的语言表征模型不同,BERT 旨在基于所有层的左、右语境来预训练深度双向表征。BERT 是首个在大批句子层面和 token 层面任务中取得当前最优性能的基于微调的表征模型,其性能超越许多使用任务特定架构的系统,刷新了 11 项 NLP 任务的当前最优性能记录。
paper link
code link

Introduction

研究证明语言模型预训练可以有效改进许多自然语言处理任务,包括自然语言推断、复述(paraphrasing)等句子层面的任务,以及命名实体识别、SQuAD 问答等 token 层面的任务。前者通过对句子进行整体分析来预测句子之间的关系,后者则要生成 token 级别的细粒度输出。

目前将预训练的语言表征应用于下游任务存在两种策略:

  • 基于特征的策略(feature-based)
  • 微调策略(fine-tuning)

基于特征的策略(如 ELMo)使用将预训练表征作为额外特征的任务专用架构(tasks-specific)。微调策略(如Generative Pre-trained Transformer(OpenAI GPT))引入了更少的任务特定参数,通过简单地微调预训练参数在下游任务中进行训练。在之前的研究中,两种策略在预训练期间使用相同的目标函数,利用单向语言模型来学习通用语言表征。

本论文作者(即 Google AI Language 团队的研究人员)认为现有的技术严重制约了预训练表征的能力,微调策略尤其如此。其主要局限在于标准语言模型是单向的,这限制了可以在预训练期间使用的架构类型。例如,OpenAI GPT 使用的是从左到右的架构,其中每个 token 只能注意 Transformer 自注意力层中的先前 token。这些局限对于句子层面的任务而言不是最佳选择,对于 token 级任务(如 SQuAD 问答)则可能是毁灭性的,因为在这种任务中,结合两个方向的语境至关重要。

本文通过 BERT(Bidirectional Encoder Representations from Transformers)改进了基于微调的策略。BERT 提出一种新的预训练目标——遮蔽语言模型(masked language model,MLM),来克服上文提到的单向局限。MLM 的灵感来自 Cloze 任务(Taylor, 1953)。MLM 随机遮蔽输入中的一些 token,目标是仅基于遮蔽词的语境来预测遮蔽词本身。与从左到右的语言模型预训练不同,MLM 目标允许表征融合左右两侧的语境,从而预训练一个深度双向 Transformer。除了 MLM,我们还引入了一个「下一句预测」(next sentence prediction)任务,该任务联合预训练文本对表征。

本文贡献如下:

  • 展示了双向预训练语言表征的重要性。不同于 Radford 等人(2018)使用单向语言模型进行预训练,BERT 使用 MLM 预训练深度双向表征。本研究与 Peters 等人(2018)的研究也不同,后者使用的是独立训练的从左到右和从右到左 LM 的浅层级联。

  • 证明了预训练表征可以消除对许多精心设计的任务特定架构的需求。BERT 是首个在大批句子层面和 token 层面任务中取得当前最优性能的基于微调的表征模型,其性能超越许多使用任务特定架构的系统。

  • BERT 刷新了 11 项 NLP 任务的当前最优性能记录。本论文还报告了 BERT 的模型简化测试(ablation study),证明该模型的双向特性是最重要的一项新贡献。

Feature-based Approaches

ELMo 是从语言模型中提取上下文相关的特征,将其移植到特定任务的架构中,在很多NLP任务上得到了提升。注意Feature-based的预训练表征在应用的时候只是被添加到特定任务架构中,作为一个额外的特征。ELMo也是使用的bi-LM,但是两个方向的语言模型是相互独立的,只是在最后将两个方向的语言模型隐层状态拼接在一起,这一点跟本文提出的BERT有本质差异。

Fine-tuning Approaches

微调策略(如Generative Pre-trained Transformer(OpenAI GPT))引入了更少的任务特定参数,通过简单地微调预训练参数在下游任务中进行训练。

OpenAI GPT利用了Transformer网络代替了LSTM作为语言模型来更好的捕获长距离语言结构。然后在进行具体任务有监督微调时使用了语言模型作为附属任务训练目标。最后在 12 个 NLP 任务上进行了实验,9 个任务获得了 SOTA。

OpenAI GPT使用的是标准的语言模型目标函数,即通过前 k 个词预测当前词,但是在语言模型网络上,使用了 Google 团队在 Attention is all your need 论文中提出的 Transformer 解码器作为语言模型基础架构。

然后在具体 NLP 任务上有监督微调时,与 ELMo 当成特征的做法不同,OpenAI GPT 不需要针对于不同任务构建不同的模型结构,而是直接在 Transformer 这个语言模型上的最后一层接上 softmax 作为任务输出层,然后再对这整个模型进行微调。论文发现,如果使用语言模型作为辅助任务,能够提升有监督模型的泛化能力,并且能够加速收敛

由于不同 NLP 任务的输入有所不同,在 Transformer 模型的输入上针对不同 NLP 任务也有所不同。具体如下图,对于分类任务直接文本输入即可;对于文本蕴涵任务,需要将前提和假设用一个 Delim 分割向量拼接后进行输入;对于文本相似度任务,在两个方向上都使用 Delim 拼接后,进行输入;对于像问答多选择的任务,就是将每个答案和上下文进行拼接进行输入。

在多项任务上,OpenAI GPT 的效果要比 ELMo 的效果更好

BERT

Model Architecture

BERT 的模型架构是一个多层双向 Transformer 编码器,基于 Vaswani 等人 (2017) 描述的原始实现,在 tensor2tensor 库中发布。由于 Transformer 的使用最近变得很普遍,而且我们的实现与原始版本实际相同,我们将不再赘述模型架构的背景。(guide

在本文中,我们将层数(即 Transformer 块)表示为 L,将隐藏尺寸表示为 H、自注意力头数表示为 A。在所有实验中,我们将前馈/滤波器尺寸设置为 4H,即 H=768 时为 3072,H=1024 时为 4096。我们主要报告在两种模型尺寸上的结果:

  • BERTBASE: L=12, H=768, A=12, 总参数=110M

  • BERTLARGE: L=24, H=1024, A=16, 总参数=340M

为了比较,BERTBASE 的模型尺寸与 OpenAI GPT 相当。然而,BERT Transformer 使用双向自注意力机制,而 GPT Transformer 使用受限的自注意力机制,导致每个 token 只能关注其左侧的语境。我们注意到,双向 Transformer 在文献中通常称为「Transformer 编码器」,而只关注左侧语境的版本则因能用于文本生成而被称为「Transformer 解码器」。图 1 直观显示了 BERT、OpenAI GPT 和 ELMo 的比较结果。

Figure  1:  Differences  in  pre-training  model  architectures.  BERT  uses  a  bidirectional  Transformer.  OpenAI  GPT uses  a  left-to-right  Transformer.  ELMo  uses  the  concatenation  of  independently  trained  left-to-right  and  right-to-left LSTM  to  generate  features  for  downstream  tasks.  Among  three,  only  BERT  representations  are  jointly conditioned  on  both  left  and  right  context  in  all  layers.

Input Representation

模型的输入是单个的文本或者文本对(eg, [Question, Answer]),注意文本不一定是单个句子,有可能是多个句子的集合。
输入表示如下:

  • We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. We denote split word pieces with ##.

  • We use learned positional embeddings with supported sequence lengths up to 512 tokens.

  • The first token of every sequence is always the special classification embedding ([CLS]). The final hidden state (i.e., output of Transformer) corresponding to this token is used as the aggregate sequence representation for classification tasks. For non-classification tasks, this vector is ignored.

  • Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned sentence A embedding to every token of the first sentence and a sentence B embedding to every token of the second sentence.

  • For single-sentence inputs we only use the sentence A embeddings.

Figure  2:  BERT  input  representation.  The  input  embeddings  is  the  sum  of  the  token  embeddings,  the  segmentation embeddings  and  the  position  embeddings.

Pre-training Tasks

Task #1: Masked LM

Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context.

为了训练深度双向表征,本文采取了一个直接的方法,随机遮蔽输入 token 序列的某些部分,然后只预测被遮住的 token。将这一步骤称为「masked LM」(MLM),不过它在文献中通常被称为 Cloze 任务 (Taylor, 1953)。在这种情况下,对应遮蔽 token 的最终隐层向量会输入到 softmax 函数中,并如标准 LM 中那样预测所有词汇的概率。在所做的所有实验中,我们随机遮住了每个序列中 15% 的 WordPiece token。与去噪自编码器 (Vincent et al., 2008) 相反,我们仅预测遮蔽单词而非重建整个输入。

但是这样做会带来两个缺点:

  1. 预训练用 [MASK] 提出挡住的词后,在微调阶段是没有 [MASK] 这个词的,所以会出现不匹配;

  2. 预测 15% 的词而不是预测整个句子,使得预训练的收敛更慢。但是对于第二点,作者们觉得虽然是慢了,但是效果提升比较明显可以弥补。

对于第一点他们采用了下面的技巧来缓解,即不是总是用 [MASK] 去替换挡住的词,在 10% 的时间用一个随机词取替换,10% 的时间就用这个词本身。

Rather than always replacing the chosen words with [MASK], the data generator will do the following:

  • 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy -> my dog is [MASK]

  • 10% of the time: Replace the word with a random word, e.g., my dog is hairy -> my dog is apple

  • 10% of the time: Keep the word unchanged, e.g., my dog is hairy -> my dog is hairy. The purpose of this is to bias the representation towards the actual observed word.

Task #2: Next Sentence Prediction

很多重要的下游任务(如问答(QA)和自然语言推断(NLI))基于对两个文本句子之间关系的理解,这种关系并非通过语言建模直接获得。为了训练一个理解句子关系的模型,我们预训练了一个二值化下一句预测任务,该任务可以从任意单语语料库中轻松生成。具体来说,选择句子 A 和 B 作为预训练样本:B 有 50% 的可能是 A 的下一句,也有 50% 的可能是来自语料库的随机句子。

Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

We choose the NotNext sentences completely at random, and the final pre-trained model achieves 97%-98% accuracy at this task. Despite its simplicity, we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI.

Pre-training Procedure

作者使用的语料库是BooksCorpus (800M words) (Zhu et al.,2015) and English Wikipedia (2,500M words).

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is <= 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.

The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood.

Fine-tuning Procedure

对于sequence-level的分类任务来说,为了得到输入序列的一个固定维度的特征表示,直接将输入CLS token对应的最终的Transformer隐层状态作为输入序列的表示,假定得到的向量为 $C \epsilon R^H$ ,在fune-tuning阶段唯一需要添加的是softmax分类层。

For span-level and token-level prediction tasks, the above procedure must be modified slightly in a task-specific manner. Details are given in the corre- sponding subsection of Section 4.

BERT模型的所有参数和softmax分类层的参数在fune-tuning阶段联合训练,目标是最大化正确标签的似然概率。模型的超参数在fune-tuning阶段大多数保持不变,除了batch_size,learning rate,training epochs。

The optimal hyperparameter values are task-specific, but we found the following range of possible values to work well across all tasks:

  • Batch size: 16, 32
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5
  • Number of epochs: 3, 4

We also observed that large data sets (e.g.,100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.

Comparison of BERT and OpenAI GPT

作者比较了 BERT and OpenAI GPT的差异,试图解释BERT性能改善是否来源于本文提出的两种新奇的预训练任务。

  • GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).

  • GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.

  • GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.

  • GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.

Experiments

Figure  3:  Our  task  specific  models  are  formed  by  incorporating  BERT  with  one  additional  output  layer,  so  a minimal  number  of  parameters  need  to  be  learned  from  scratch.  Among  the  tasks,  (a)  and  (b)  are  sequence-level
tasks  while  (c)  and  (d)  are  token-level  tasks.  In  the  figure,  E  represents  the  input  embedding,  Ti  represents  the
contextual  representation  of  tokeni,  [CLS]  is  the  special  symbol  for  classification  output,  and  [SEP]  is  the  special
symbol  to  separate  non-consecutive  token  sequences.

Table  1:  GLUE  Test  results,  scored  by  the  GLUE  evaluation  server.  The  number  below  each  task  denotes  the number  of  training  examples.  The  “Average”  column  is  slightly  different  than  the  official  GLUE  score,  since we  exclude  the  problematic  WNLI  set.  OpenAI  GPT  =  (L=12,  H=768,  A=12);  BERTBASE  =  (L=12,  H=768,A=12);  BERTLARGE  =  (L=24,  H=1024,  A=16).  BERT  and  OpenAI  GPT  are  single-model,  single  task.  All results  obtained  from  https://gluebenchmark.com/leaderboard  and  https://blog.openai.com/language-unsupervised/.

Table  2:  SQuAD  results.  The  BERT  ensemble  is  7x systems  which  use  different  pre-training  checkpoints and  fine-tuning  seeds.

Table  3:  CoNLL-2003  Named  Entity  Recognition  results.  The  hyperparameters  were  selected  using  the Dev  set,  and  the  reported  Dev  and  Test  scores  are  averaged  over  5  random  restarts  using  those  hyperparameters.

Table  6:  Ablation  over  BERT  model  size.  #L  =  the number  of  layers;  #H  =  hidden  size;  #A  =  number  of  attention  heads.  “LM  (ppl)”  is  the  masked  LM  perplexity of  held-out  training  data.

作者也做了像 ELMo 当成特征加入的实验,从下图可以看到,当成特征加入最好效果能达到 96.1% 和微调的 96.4% 差不多,说明 BERT 对于基于特征和基于微调这两种方法都是有效的。

Table  7:  Ablation  using  BERT  with  a  feature-based  approach  on  CoNLL-2003  NER.  The  activations  from  the specified  layers  are  combined  and  fed  into  a  two-layer BiLSTM,  without  backpropagation  to  BERT.

Conclusion

和传统的词向量相比,使用语言模型预训练其实可以看成是一个句子级别的上下文的词表示,它可以充分利用大规模的单语语料,并且可以对一词多义进行建模。证明了预训练表征可以消除对许多精心设计的任务特定架构的需求。BERT 是首个在大批句子层面和 token 层面任务中取得当前最优性能的基于微调的表征模型,其性能超越许多使用任务特定架构的系统。

Reference